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1 Introduction: the relative entropy as an epistemological 
functional 

Shannon's Information Theory (IT) (1948) definitely established the purely 
mathematical nature of entropy and relative entropy, in contrast to the pre- 
vious identification by Boltzmann (1872) of his "i^-functional" as the phys- 
ical entropy of earlier thermodynamicians (Carnot, Clausius, Kelvin). The 
following declaration is attributed to Shannon (Tribus and Mclrvine 1971): 

My greatest concern was what to call it. I thought of calling it "informa- 
tion", but the word was overly used, so I decided to call it "uncertainty" . 
When I discussed it with John von Neumann, he had a better idea. Von 
Neumann told me, "You should call it entropy, for two reasons. In the 
first place your uncertainty function has been used in statistical mechan- 
ics under that name, so it already has a name. In the second place, and 
more important, nobody knows what entropy really is, so in a debate you 
will always have the advantage. " 

In IT, the entropy of a message limits its minimum coding length, in 
the same way that, more generally, the complexity of the message deter- 
mines its compressibility in the Kolmogorov-Chaitin-Solomonov algorithmic 
information theory (see e.g. Li and Vitanyi (1997)). 

Besides coding and compressibility interpretations, the relative entropy 
also turns out to possess a direct probabilistic meaning, as demonstrated 
by the asymptotic rate formula (|4|). This circumstance enables a complete 
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exposition of classical inferential statistics (hypothesis testing, maximum 
likelihood, maximum entropy, exponential and log-linear models, EM algo- 
rithm, etc.) under the guise of a discussion of the properties of the relative 
entropy. 

In a nutshell, the relative entropy K(f\\g) has two arguments / and 
g, which both are probability distributions belonging to the same simplex. 
Despite formally similar, the arguments are epistemologically contrasted: 
/ represents the observations, the data, what we see, while g represents 
the expectations, the models, what we believe. K(f\\g) is an asymmetrical 
measure of dissimilarity between empirical and theoretical distributions, able 
to capture the various aspects of the confrontation between models and 
data, that is the art of classical statistical inference, including Popper's 
refutationism as a particulary case. Here lies the dialectic charm of K(f\\g), 
which emerges in that respect as an epistemological functional. 

We have here attempted to emphasize and synthetize the conceptual sig- 
nificance of the theory, rather than insisting on its mathematical rigor, the 
latter being thoroughly developped in a broad and widely available litera- 
ture (see e.g. Cover and Thomas (1991) and references therein). Most of the 
illustrations bear on independent and identically distributed (i.i.d.) finitely 
valued observations, that is on dice models. This convenient restriction is 
not really limiting, and can be extended to Markov chains of finite order, 
as illustrated in the last part on textual data with presumably original ap- 
plications, such as heating and cooling texts, or additive and multiplicative 
text mixtures. 

2 The asymptotic rate formula 
2.1 Model and empirical distributions 

D = (x\X2-..x n ) denotes the data, consisting of n observations, and M 
denotes a possible model for those data. The corresponding probability is 



Assume (dice models) that each observation can take on m discrete values, 
each observation X being i.i.d. distributed as 



P(D\M), with 





D 



fj M :=P(X=j) 



3 = 1,-.., 



m. 
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fir 1 f 3 =o f 2 -t 

Figure 1: The simplex S3, where f u = (g, §, 3) denotes the uniform distri- 
bution. In the interior of S m , a distribution / can be varied along m — 1 
independant directions, that is dim(S m ) = m — 1. 

f M is the model distribution. The empirical distribution, also called /jype 
(Csiszar and Korner 1980) in the IT framework, is 

./•": ^ ./ 1 m 

J n 

where rij counts the occurences of the j-th category and n = Y^jLi n j * s ^ ne 
sample size. 

Both f M and f D are discrete distributions with m modalities. Their 
collection form the simplex S m (figure [[]) 

in 

5 = 5 ro :={/|/ i >0 and J2fj = 1 }- 

j'=i 

2.2 Entropy and relative entropy: definitions and properties 

Let f,g G £ m . The entropy H{f) of / and the relative entropy K(f\\g) 
between / and g are defined (in nats) as 

m 

H(f) ■= fi ln fi = entr °Py 01 / 

■K-ifWd) := /j hi — = relative entropy of / with respect to g . 

3=1 9j 



3 



H(f) is concave in /, and constitutes a measure of the uncertainty of the 
outcome among m possible outcomes (proofs are standard): 



• H(f) = iff / is a deterministic distribution concentrated on a single 
modality (minimum uncertainty) 

• H(f) = lnm iff / is the uniform distribution (of the form fj = 1/m) 
(maximum uncertainty). 

K(f\\g), also known as the Kullback-Leibler divergence, is convex in both 
arguments, and constitutes a non- symmetric measure of the dissimilarity 
between the distributions / and g, with 



• K{f\\g) < oo iff / is absolutely continuous with respect to g, that is 
if gj = implies fj = 0. 

Let the categories j = l,...,mbe coarse-grained, that is aggregated into 
groups of super-categories J = 1, . . . , M < m. Define 



< H(f) < lnm 



where 



0<K(/|| 5 )<oo 



where 



. K(f\\g) = 0iSf = g 



Then 



H{F) < H(f) 



K(F\\G) < K(f\\g) . 



(1) 



2.3 Derivation of the asymptotic rate (i.i.d. models) 



On one hand, straightforward algebra yields 



n 



P(D\f M ) := P(D\M) = P(x 1 x 2 ...x n \M) = H(f J M )^ 



i=i 



= eM-nK(f D \\f M )-nH(f D )] 



(2) 
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On the other hand, each permutation of the data D = {x\, . . . ,x n ) yields 
the same f D . Stirling's approximation n! = n n exp(— n) (where a n = b n 
means lin^^oo ^ ln(a n /b n ) = 0) shows that 

P{f°\M) = —r^ - P{D\M) * exp(nH(f D )) P{D\M). (3) 

nil ■ ■■n m \ 

([2]) and Q imply the asymptotic rate formula: 

P(f D \f M ) exp(-n K{f D \ \f M )) asymptotic rate formula . (4) 

Hence, K(f D \\f M ) is the asymptotic rate of the quantity P(f D \f M ), the 
probability of the empirical distribution f D for a given model f M , or equiv- 
alently the likelihood of the model f M for the data f D . Without additional 
constraints, the model f M maximizing the likelihood is simply f M = f D 
(section [3D. Also, without further information, the most probable empirical 
distribution f D is simply f D = f M (section 

2.4 Asymmetry of the relative entropy and hard falsifica- 
tionism 

K(f\\g) as a dissimilarity measure between / and g is proper (that is K(f\\g) = 
implies f = g) but not symmetric (K(f\\g) ^ i^(<7||/) in general). Sym- 
metrized dissimilarities such as J{f\\g) ■= ^(K(f\\g)+K(g\\f)) or L(f\\g) := 
iT(/|||(/ + g)) + K(g\ |^(/ + g)) have often been proposed in the literature. 

The conceptual significance of such functionals can indeed be questioned: 
from equation (JH), the first argument / of K(f\\g) should be an empirical 
distribution, and the second argument g a model distribution. Furthermore, 
the asymmetry of the relative entropy does not constitute a defect, but per- 
fectly matches the asymmetry between data and models. Indeed 

• if fM = and fP > 0, then K(f D \\f M ) = oo and, from ®, P(f D \f M ) = 
and, unless the veracity of the data f D is questioned, the model dis- 
tribution / should be strictly rejected 

• if on the contrary /* J > and ff = 0, then K(f D \\f M ) < oo and 
P(f D \f M ) > in general, and f M should not be rejected, at least for 
small samples. 

Thus the theory "All crows are black" is refuted by the single observation 
of a white crow, while the theory "Some crows are black" is not refuted by the 



5 



observation of a thousand white crows. In this spirit, Popper's falsificationist 
mechanisms (Popper 1963) are captured by the properties of the relative 
entropy, and can be further extended to probabilistic or "soft falsificationist" 
situations, beyond the purely logical true/false context (see section [33]) . 

2.5 The chi-square approximation 

Most of the properties of the relative entropy are shared by another func- 
tional, historically anterior and well-known to statisticians, namely the chi- 
square X 2 (f\\g) '■= n Ylj(fj ~ 9j) 2 /dj- As a matter of fact, the relative 
entropy and the chi-square (divided by 2n) are identical up to the third 
order: 

mf\\g) = (j2 rr L + {J ^ 1 ) = kx\f\\g) + 0(11/ - .9II 3 ) (5) 

2.5.1 Example: coin (m = 2) 

The values of the relative entropy and the chi-square read, for various f M 
and f D , as : 







r 


K(r\\f M ) 


X z (n\f M )/2n 


a) 


(0.5,0.5) 


(0.5,0.5) 








b) 


(0.5,0.5) 


(0.7,0.3) 


0.0823 


0.08 


c) 


(0.7,0.3) 


(0.5,0.5) 


0.0822 


0.095 


d) 


(0.7,0.3) 


(0.7,0.3) 








e) 


(0.5,0.5) 


(1,0) 


0.69 


0.5 





(1,0) 


(0.99,0.01) 


oo 


oo 



3 Maximum likelihood and hypothesis testing 
3.1 Testing a single hypothesis (Fisher) 

As shown by gj), the higher K(f D \\f M ), the lower the likelihood P{f D \f M ). 
This circumstance permits to test the single hypothesis Hq : "the model dis- 
tribution is f M ". If H were true, f D should fluctuate around its expected 
value f M , and fluctuations of too large amplitude, with occurrence proba- 
bility less than a (the significance level), should lead to the rejection of f M . 
Well-known results on the chi-square distribution (see e.g. Cramer (1946) or 
Saporta (1990)) together with approximation fl5J) shows 2nK(f D \\f M ) to be 
distributed, under Hq and for n large, as X 2 [df] with df = dim(5 m ) = m — 1 
degrees of freedom. 



6 



Therefore, the test consists in rejecting Hq at level a if 

2nK(f D \\f M )>xt a[ m-i]. (6) 

In that respect, Fisher's classical hypothesis testing appears as a soft falsi- 
ficationist strategy, yielding the rejection of a theory f M for large values of 
K(f D \\f M ). It generalizes Popper's (hard) falsificationism which is limited 
to situations of strict refutation as expressed by K(f D \\f M ) = oo. 

3.2 Testing a family of models 

Very often, the hypothesis to be tested is composite, that is of the form Hq : 
ujM £ _A4" ; where Ai C S = S m constitutes a family of models containing 
a number dim(.M) of free, non-redundant parameters. 

If the observed distribution itself satisfies f D G M, then there is obvi- 
ously no reason to reject Hq. But / ^ A4 in general, and hence 

mm feM K(f D \\f) = K(f D \\f M ) is strictly positive, with f M := ^gmm feM K(f D \\f) . 

f M is known as the maximum likelihood estimate of the model, and depends 
on both f D and M. We assume f M to be unique, which is e.g. the case if 
A4 is convex. 

If f M G M, 2nK{f D \\f M ) follows a chi-square distribution with dim(S)- 
dim(A / () degrees of freedom. Hence, one rejects Hq at level a if 

2nK{f D \\f M ) > xi- a [dim(S)-diMM)] . (7) 

If M. reduces to a unique distribution f M , then dim(A / J) = and ([7]) reduces 
to ([6]). In the opposite direction, M. = S defines the saturated model, in 
which case (J7|) yields the undefined inequality > Xi_ a [0]. 

3.2.1 Example: coarse grained model specifications 

Let / be a dice model, with categories j = 1, . . . , m. Let J = 1, . . . , M < 
m denote groups of categories, and suppose that the model specifications 
are coarse-grained (see p])), that is 

M = {f M \ £/f = Ff J = 1,...,M}. 

Let J(j) denote the group to which j belongs. Then the maximum likelihood 
(ML) estimate is simply 
pM 

Jf fj' T~[J~ where F? :=^/f and K{f D \\f M ) = K(F D \\F M ). (8) 
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3.2.2 Example: independence 



Let X and Y two categorical variables with modalities j = 1, . . . ,nii and 
k = 1, . . . , m2. Let fjk denote the joint distribution of (X, Y). The distribu- 
tion of X alone (respectively Y alone) obtains as the marginal fj, := fjk 
(respectively := ^ ■ fjk)- Let A-l denote the set of independent distribu- 
tions, i.e. 

M = {f £ S \ f jk = CLjb k } . 
The corresponding ML estimate f M G is 

/jfc 1 = /> /ifc where := /jk and := $J 

fc j 

with the well-known property (where Hjj(.) denotes the entropy associated 
to the empirical distribution) 

K(f D \\f M ) = H D (X) + H D (Y) - H D (X,Y) = i^ jk ~ / 1^ ' (9) 

The rauteaZ information I(X : Y) := H D (X) + H D (Y) - H D (X,Y) is the 
information-theoretical measure of dependence between X and Y. Inequal- 
ity H D (X,Y) < H D {X) + H D (Y) insures its non-negativity. By @, the 
corresponding test reduces to the usual chi-square test of independence, 
with dim(S') — dim(.M) = (m\m2 — 1) — (ra\ + 7712 — 2) = (mi — Y)(mi — 1) 
degrees of freedom. 

3.3 Testing between two hypotheses (Neyman-Pearson) 

Consider the two hypotheses H : " f M = f° " and if i : " f M = f 1 " , where 
/° and f 1 constitute two distinct distributions in S. Let W C S denote the 
rejection region for /°, that is such that H\ is accepted if f D € W, and Hq 
is accepted if / D G W c := 5 \ W. The errors of first, respectively second 
kind are 

a := P(/ D eW\f°) (3:= P(f D G | / x ) . 

For n large, Sanov's theorem (|18h below shows that 

a exp(-n^(/°j|/ )) /° := arg min (10) 

few 

H - exp(-nEr(/ 1 ||/ 1 )) := arg min K{f\\f x ). 

few c 
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The rejection region W is said to be optimal if there is no other region 
W CS with a(W) < a(W) and p(W') < 0(W). The celebrated Neyman- 
Pearson lemma, together with the asymptotic rate formula @, states that 
W is optimal iff it is of the form 

W = {f | >T} = {f | K(f\\f) - K(f\\f) > ilnT := r} (11) 

One can demonstrate (see e.g. Cover and Thomas (1991) p. 309) that the 
distributions (110p governing the asymptotic error rates coincide when W is 
optimal, and are given by the multiplicative mixture 

(f°V{f 1 ) 1 ~ >l 

f] = f} = m ■= Efc ( / o ) ,( / i ) i-, (12) 

where \i is the value insuring K (/(p)||/°) — ^(/(^i)!!/ 1 ) = r. Finally, the 
overall probability of error, that is the probability of occurrence of an error 
of first or second kind, is minimum for r = 0, with rate equal to 

K[f(n*)\\f) = K{f{f)\\f) = - mm H^iflYifl) 1 ^) =: C^ ,/ 1 ) 

0</i<l 

where /i* is the value minimising the third term. The quantity C(f°,f 1 ) > 
0, known as Chernoff information, constitutes a symmetric dissimilarity 
between the distributions / and / , and measures how easily f° and f 1 can 
be discriminated from each other. In particular, C(/°, f 1 ) = iff f° = f . 

Example 12.5. 11 continued: coins 

Let / := (0.5,0.5), g := (0.7,0.3), h := (0.9,0.1) and r := (1,0). Numerical 
estimates yield (in nats) C(f,g) = 0.02, C{f,h) = 0.11, C{g,h) = 0.03 and 
C{f,r) = In 2 = 0.69. 



3.4 Testing a family within another 

Let Mq and M.\ be two families of models, with A4q cMi and dim(A4o) < 
dim(vMi). Consider the test of Hq within Hi, opposing #0 : u f M £ -Mo" 
against Hi : "f M € Mi". 

By construction, K(f D \ \f Mo ) > X(/' D | l/ - ^ 1 ) since TWi is a more gen- 
eral model than Mq. Under Hi, their difference can be shown to follow 
asymptotically a chi-square distribution. Precisely, the nested test of Hq 
within Hi reads: "under the assumption that Hi holds, rejects Hq if 

2n [K{f D \\f M ») - K(f D \\f M ^)} > X ?_ a [dim(^ 1 )-dim(Mo)] ". (13) 
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3.4.1 Example: quasi-symmetry, symmetry and marginal homo- 
geneity 

Flows can be represented by a square matrix fjk > such that Y^j=i SfcLi fjk = 
1, with the representation "fjk = proportion of units located at place j at 
some time and at place k some fixed time later" . 

A popular model for flows is the quasi- symmetric class QS (Caussinus 
1966), known as the Gravity model in Geography (Bavaud 2002a) 

Q s = {/ I fjk = ajfikjjk with j jk = jkj} 

where ctj quantifies the "push effect" , fik the "pull effect" and ^jk the "dis- 
tance deterrence function". 

Symmetric and marginally homogeneous models constitute two popular al- 
ternative families, defined as 

S = {/ | fjk = fkj} MH = {/ | fj. = f.j} . 

Symmetric and quasi-symmetric ML estimates satisfy (see e.g. Bishop and 
al. (1975) or Bavaud (2002a)) 

fS _ \(fD , f D\ ?QS , ?QS _ f D , f D fQS _ f D ?QS _ f D 

Jjk — 2^ j k ' Jkj) Jjk ' Jkj ~ Jjk ' Jkj J j» ~ Jjm J »k ~ J *k 

from which the values of f^ s can be obtained iteratively. A similar yet more 
involved procedure permits to obtain the marginal homogeneous estimates 
f m 

By construction, S C QS, and the test (fl~3l) consists in rejecting S (under 
the assumption that QS holds) if 

2n [K(f D \\f s ) - K(f D \\f* 5 )] > xl-alm-i) ■ (14) 

Noting that S = QS n MH, (|14p actually constitutes an alternative testing 
procedure for QS, avoiding the necessity of computing f m (Caussinus 1996). 

Example 13.4.11 continued: inter-regional migrations 

Relative entropies associated to Swiss inter-regional migrations flows 1985- 
1990 (m = 26 cantons; see Bavaud (2002a)) are K(f D \\f s ) = .00115 (with 
df = 325) and K(f D \\f QS ) = .00044 (with df = 300). The difference is .00071 
(with df = 25 only) and indicates that flows asymmetry is mainly produced 
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by the violation of marginal homogeneity (unbalanced flows) rather than 
the violation of quasi-symmetry. However, the sheer size of the sample 
(n = G'OSO'SIS) leads, at conventional significance levels, to reject all three 
models S , MH and G)S. 



3.5 Competition between simple hypotheses: Bayesian se- 
lection 

Consider the set of q simple hypotheses u H a : f M = g a ", where g a 6 S m 
for a = 1, . . . , q. In a Bayesian setting, denote by P(H a ) = P(g a ) > the 
prior probability of hypothesis H a , with 'Y^ q a= iP{H a ) = 1. The posterior 
probability P(H a \D) obtains from Bayes rule as 

P(H a \D) = P[Ha) p P { ^ Ha) with P(D) = Y { P(H a )P(D\H a ). 

Direct application of the asymptotic rate formula (jl]) then yields 

P{g a \f D ) — p ( g ) cxp ^jJ^^ ^ 9 ^ (Bayesian hypothesis selection formula) (15) 

which shows, for n — > oo, the posterior probability to be concentrated on 
the (supposedly unique) solution of 

g = argmin K(f*\\g a ) where /* := lim f D . 

g a n— >oo 

In other words, the asymptotically surviving model g a minimises the relative 
entropy K(g a \\f*) with respect to the long-run empirical distribution /*, in 
accordance with the ML principle. 

For finite ra, the relevant functional is K(f D \\g a )) — ^ lnP(^ a ), where the 
second term represents a prior penalty attached to hypothesis H a . Attempts 
to generalize this framework to families of models A4 a (a = 1, . . . , q) lie at 
the heart of the so-called model selection procedures, with the introduction 
of penalties (as in the AIC, BIC, DIC, ICOMP, etc. approaches) increas- 
ing with the number of free parameters dim(M a ) (see e.g. Robert (2001)). 
In the alternative minimum description length (MDL) and algorithmic com- 
plexity theory approaches (see e.g. MacKay (2003) or Li and Vitanyi (1997)), 
richer models necessitate a longer description and should be penalised ac- 
cordingly. All those procedures, together with Vapnik's Structural Risk Min- 
imization (SRM) principle (1995), aim at controlling the problem of over- 
parametrization in statistical modelling. We shall not pursue any further 
those matters, whose conceptual and methodological unification remains yet 
to accomplish. 
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3.5.1 Example: Dirichlet priors 

Consider the continuous Dirichlet prior g ~ T> (a), with density p(g\a) = 
Tl-T(a-) Ilj 9j 3 > normalised to unity in S m , where a = . . . , a m ) is a 
vector of parameters with ctj > and a := Ylj a j- Setting ttj := aj/a = 
E(gj\a), Stirling approximation yields p(g\a) = exp(— aK(ir\\g)) for a large. 

Alfter observing the data n = (n±, . . . , n m ), the posterior distribution is 
well-known to be T>(a + n). Using /■ = nj/n, one gets p(g\a + n)/p(g\a) = 
exp(— nK(f D \\g)) for n large, as it must from (|15p . Hence 

p(g\a + n) * eM-aK(7r\\g)-nK(f D \\g)} * exp[-(a + n)K{f^\ \g)\ (16) 
where fj° st = E( gj \a + n) = A ttj + (1 - A)/f with A := -^-^ . (17) 



(|16p and (|17p show the parameter a to measure the strength of belief in the 
prior guess, measured in units of the sample size (Ferguson 1974). 

4 Maximum entropy 

4.1 Large deviations: Sanov's theorem 

Suppose data to be incompletely observed, i.e. one only knows that f D € T>, 
where V C S is a subset of the simplex S, the set of all possible distributions 
with m modalites. Then, for an i.i.d. process, a theorem due to Sanov (1957) 
says that, for sufficiently regular V, the asymptotic rate of the probability 
that f D G T> under model f M decreases exponentially as 

P(f° G V\f M ) - exp(-n K{f v \\f M )) where f v := argmmif (/||/ M ) . (18) 

f v is the so-called maximum entropy (ME) solution, that is the most proba- 
ble empirical distribution under the prior model f M and the knowledge that 
f D G V. Of course, f v = f M if f M G V. 

4.2 On the nature of the maximum entropy solution 

When the prior is uniform (fj = 1/rn), then 

K(f D \\f M )=lnm-H(f D ) 

and minimising (over / G T>) the relative entropy K(f\\f M ) amounts in 
maximising the entropy H(f D ) (over / G T>). 
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For decades (ca. 1950-1990), the "maximum entropy" principle, also 
called "minimum discrimination information (MDI) principle" by Kullback 
(1959), has largely been used in science and engineering as a first-principle, 
"maximally non-informative" method of generating models, maximising our 
ignorance (as represented by the entropy) under our available knowledge 
(/ G V) (see in particular Jaynes (1957), (1978)). 

However, (|18H shows the maximum entropy construction to be justified 
from Sanov's theorem, and to result form the minimisation of the first argu- 
ment of the relative entropy, which points towards the empirical (rather than 
theoretical) nature of the latter. In the present setting, f v appears as the 
most likely data reconstruction under the prior model and the incomplete 
observations (see also section [5\3j) . 

4.2.1 Example: unobserved category 

Let f M be given and suppose one knows that a category, say j = 1, has not 
occured. Then 

1? = {1 W Z] = > \ -* *</V) = -Mi-tf), 

v 1 Ji 

whose finiteness (for ff 1 < 1) contrasts the behavior K(f M \\f v ) = oc (for 
f{* > 0) . See example I2XT1 f ) . 

4.2.2 Example: coarse grained observations 

Let f M be a given distribution with categories j = l,...,m. Let J = 
1, . . . ,M < m denote groups of categories, and suppose that observations 
are aggregated or coarse-grained, i.e. of the form 

V = {f D \Y J f? = F? J = 1,-..,M}. 

Let J(j) denote the group to which j belongs. The ME distribution then 
reads (see ([8|) and example 13.2. ip 

/J^fl 1 where F? :=£/f and K(f v \\f M ) = K(F D \\F M ). (19) 

J(j) jeJ 
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4.2.3 Example: symmetrical observations 



Let fj£ be a given joint model for square distributions (j, k = 1, ... ,m). 
Suppose one knows the data distribution to be symmetrical, i.e. 

V = {f\fjk = fkj } ■ 

Then 



}fk = ll*J]± where Z:=E ^^ 

which is contrasted with the result fj£ = \{ff k + fR) of example 13.4.11 (see 
section I5.1|) . 

4.3 "Standard" maximum entropy: linear constraint 

Let T> be determined by a linear constraint of the form 

m 

T> = {/ | > /,-a,- = a } with mina,- < a < maxo,- 

r— f 3 3 

that is, one knows the empirical average of some quantity {a,j} r ^ =l to be 
fixed to a. Minimizing over / G S the functional 

m 

K(f\\f M ) + 9A(f) A(f): Y^jjoj (20) 

3=1 

_ ff exp(0a,) ™ M 

yields /f = h ^ 3 ' Z[d)x=Yjk exp(0a fc ) (21) 



where the Lagrange multiplier is determined by the constraint a (9) :- 
E, /fW«i = «(see figure 

4.3.1 Example: average value of a dice 



Suppose one believes a dice to be fair (f^ 1 = 1/6), and one is told that 
the empirical average of its face values is say a = Ylj ff 3 = 4, instead of 
a = 3.5 as expected. The value of 9 in ([2T]) insuring V • fP j = 4 turns out 



to be 9 = 0.175, insuring £ . /? j = 4, as well as ff = 0.10, /f = 0.12, 



J 3 J 3 

ff = 0.15, ff = 0.17, ff = 0.25, /f = 0.30 (Cover and Thomas (1991) p. 
295). 
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Figure 2: typical behaviour of a{6) 



4.3.2 Example: Statistical Mechanics 

An interacting particle system can occupy m >> 1 configurations j = 
l,...,m, a priori equiprobable (/j^ = 1/m), with corresponding energy 
Ej. Knowing the average energy to be E, the resulting ME solution (with 
f3 := —6) is the Boltzmann-Gibbs distribution 

/7 = expB^) z(P) :=±eM-m) (22) 

k=i 

minimising the free energy F(f) := E(f) — TH(f), obtained (up to a con- 
stant term) by multiplying the functional (|20p by the temperature T := 
1/(3 = —1/9. Temperature plays the role of an arbiter determining the 
trade-off between the contradictory objectives of energy minimisation and 
entropy maximisation: 

• at high temperatures T — > oo (i.e. (5 —¥ + ), the Boltzmann-Gibbs 
distribution f v becomes uniform and the entropy H(f v ) maximum 
(fluid- like organisation of the matter) . 

• at low temperatures T — > + (i.e. j3 — > oo), the Boltzmann-Gibbs 
distribution f v becomes concentrated on the ground states j_ := 
argmhij Ej, making the average energy E(f^) minimum (crystal-like 
organisation of the matter) . 

Example 13.4. 1|, continued: quasi-symmetry 

ME approach to gravity modelling consists in considering flows constrained 
by q linear constraints of the form 

m 

V = {f \ fjka% = ~a a a = l,...,q} 
j,k=i 
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such that, typically 

1) cijk := djk = dkj (fixed average trip distance, cost or time djk) 

2) a^ k := 5j a (fixed origin profiles, a = 1, . . . , m ) 

3) a^ k := 5&k (fixed destination profiles, a = 1, . . . , m) 

4) a,jk := 5jk (fixed proportion of stayers) 

5) a,jk := 5j a — 5 a k (balanced flows, a = 1, . . . , m) 

Constraints 1) to 5) (and linear combinations of them) yield all the "classical 
Gravity models" proposed in Geography, such as the exponential decay model 
(with f% = aj b k ): 

ffk = a iPk exp(-/3d jfc ) 

Moreover, if the prior f AI is quasi-symmetric, so is f v under the above 
constraints (Bavaud 2002a). 



5 Additive decompositions 

5.1 Convex and exponential families of distributions 
Definition: a family T C S of distributions is a convex family iff 

f,g£T^\f + (l-\)g£T V A G [0, 1] 

Observations typically involve the identification of merged categories, and 
the corresponding empirical distributions are coarse grained, that is deter- 
mined through aggregated values Fj := J2jeJ fj only. Such coarse grained 
distributions form a convex family (see table [1]). More generally, linearly 
constrained distributions (section I4.3|) are convex. Distributions (|lip be- 
longing to the optimal Neyman- Pearson regions W (or W c ), posterior dis- 
tributions ()17p as well as marginally homogeneous distributions (example 
I3.4.ip provide other examples of convex families. 

Definition: a family T C S of distributions is an exponential family iff 
/.jeJ^^eJ where Z(/x) := £ ffg]^ V (jl € [0, 1] 
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Family T 


characterization 


remark 


convex 


expon. 


deficient 


/i = 




yes 


yes 


deterministic 


/i = l 




yes 


yes 


coarse grained 






yes 


no 


mixture 


fj = f(Jq) = Pq h J 


{h q j} fixed 


yes 


yes 


mixture 


fj = f(Jq) = PqK 


{hj} adjustable 


no 


yes 


independent 


fjk = a jbk 




no 


yes 


marginally homog. 


fj* = f»j 


square tables 


yes 


no 


symmetric 


fjk = fkj 


square tables 


yes 


yes 


quasi-symmetric 


fjk — QjbkCjkj Cjk Ckj 


square tables 


no 


yes 



Table 1: some convex and/or exponential families 



Exponential families are a favorite object of classical statistics. Most clas- 
sical discrete or continuous probabilistic models (log-linear, multinomial, 
Poisson, Dirichlet, Normal, Gamma, etc.) constitute exponential families. 
Amari (1985) has developed a local parametric characterisation of exponen- 
tial and convex families in a differential geometric framework. 

5.2 Factor analyses 

Independence models are exponential but not convex (see table [1]): the 
weighted sum of independent distributions is not independent in general. 
Conversely, non-independent distributions can be decomposed as a sum of 
(latent) independent terms through factor analysis. The spectral decompo- 
sition of the chi-square producing the factorial correspondence analysis of 
contingency tables turns out to be exactly applicable on mutual information 
([9]) as well, yielding an "entropic" alternative to (categorical) factor analysis 
(Bavaud 2002b). 

Independent component analysis (ICA) aims at determining the linear 
transformation of multivariate (continuous) data making them as indepen- 
dent as possible. In contrast to principal component analysis, limited to the 
second-order statistics associated to gaussian models, ICA attempts to take 
into account higher-order dependencies occurring in the mutual information 
between variables, and extensively relies on information-theoretic principles, 
as developed in Lee et al. (2000) or Cardoso (2003) and references therein. 
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5.3 Pythagorean theorems 

The following results, sometimes referred to as the Pythagorean theorems of 
IT, provide an exact additive decomposition of the relative entropy: 

Decomposition theorem for convex families: if T> is a convex fam- 
ily, then 

K(f\\f M ) = K(f\\f v ) + K(f v \\f M ) for any / G V (23) 

where f v is the ME distribution for T> with prior / . 

Decomposition theorem for exponential families: if Ai is an ex- 
ponential family, then 

K{f D \\g) = K(f D \\f M ) + K(f M \\g) for any 5 G M (24) 

where f M is the ML distribution for Ai with data f D . 

Sketch of the proof of (1231) (see e.g. Simon 1973): if V is convex 
with dim(D) = dim(5) — q, its elements are of the form T> = {/ | ^ . /j-a" = 
Oq for a = 1, . . . , q}, which implies the maximum entropy solution to be of 
the form /J 5 = exp(^ a A Q a")/j w /Z(A). Substituting this expression and 
using Y,j fjdj = ff a( j proves 

Sketch of the proof of (1241) (see e.g. Simon 1973): if Ai is exponential 
with dim(A / [) = r, its elements are of the form fj = pj exp(5^ =1 X a a^)/Z{X) 
(where the partition function Z(X) insures the normalisation), containing r 
free non-redundant parameters A G W' . Substituting this expression and 
using the optimality condition Ylj fj* a< j = Ylj $f a< j for all a = 1, . . . , r 
proves (|24"|) . 

Equations (f23|) and (f24"j) show that f v and f M can both occur as left and 
right arguments of the relative entropy, underlining their somehow hybrid 
nature, intermediate between data and models (see section I4T2]) . 

5.3.1 Example: nested tests 

Consider two exponential families Ai and Af with Ai C Af. Twofold appli- 
cation of (124j) demonstrates the identity 

K(f D \\f M ) - K{f D \\f N ) = K(f N \\f M ) 

occuring in nested tests such as (fT4"l) . 
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5.3.2 Example: conditional independence in three-dimensional 
tables 

Let f® k := riijk/n with n := n,„ be the empirical distribution associated 
to the riijk = " number of individuals in the category i of X , j ofY and k 
of Z " . Consider the families of models 

£ = {/ G S | fijk = ciijbk} = {/ G S | In f ijk = A + ay + 
A* = {/ € S | /. ifc = cidfc} = {/ G S | In /. ifc = /x + 7j + 
A/ = {/ G 5 | /yfe = e^/ijfc} = {/ G 5 | ln/yit = v + + r^} . 

Model C expresses that Z is independent from X and Y (denoted Z _L 
(X, y)). Model Al expresses that Z and y are independent (Y _L Z). Model 
A/ expresses that, conditionally to Y, X and Z are independent (X _L Z\Y). 
Models £ and A/ are exponential (in S*), and Al is exponential in the space 
of joint distributions on (Y, Z). They constitute well-known examples of 
log-linear models (see e.g. Christensen (1990)). 

Maximum likelihood estimates and associated relative entropies obtain 
as (see example I3.2.2P 

fijk = f$.&k K(f D \\f c ) = H D {XY) + H D (Z)-H D (XYZ) 

/[it 4r ./.'j. /.w, => K{f D \\f M ) = H D (Y) + H D {Z)-H D {YZ) 



/ • 



■jk 

rD sD 

fijk = ij ' fD ' jk K{f°\\f M ) = H D (XY) + H D (YZ) - H D (XYZ) - H D {Y) 

and permit to test the corresponding models as in ([7]). As a matter of 
fact, the present example illustrates another aspect of exact decomposition, 
namely £ = M n M 

fSJf jk = i%ffl k K(f D \\f c ) = K(f D \\f M ) + K(f D \\f*) df £ =dr + df 

where df denotes the appropriate degrees of freedom for the chi-square 
test (0). 

5.4 Alternating minimisation and the EM algorithm 
5.4.1 Alternating minimisation 

Maximum likelihood and maximum entropy are particular cases of the gen- 
eral problem 

minmini^/Hg) . (25) 
feJ 7 g&G 
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Alternating minimisation consists in defining recursively 



:= argminK(/|| 5 W) (26) 
g ( n + l ) ■= argmm K(f^\\g) . (27) 

Starting with some € Q (or some € J 7 ), and for T and £/ convex, 
K(f^ nS> | converges towards ([25]) (Csiszar (1975); Csiszar and Tusnady, 
1984). 

5.4.2 The EM algorithm 

Problem ([26]) is easy to solve when T is the coarse grained family {/| Ylj^j fj : 

Fj}, with solution (USD /j n) = gf ] Fj^/G^ and the result K(/( n )|| 5 ( n )) = 

if(F||G( n )) (see example 

The present situation describes incompletely observed data, in which 
F only (and not /) is known, with corresponding model G(g) in Ad := 
{G\Gj = J2je.j9j and g G £}. Also 

mia-KYFIIG) = minK (-FlIGfa)) = mmmmK( f\\g) 
GeM geG g&G f&T 

= lim K{f^\\g^)= lim ET(F||G (n) ) 

which shows G^ 00 -* to be the solution of minc g _vi K(F\\G). This particular 
version of the alternating minimisation procedure is known as the EM algo- 
rithm in the literature (Dempster et al. 1977), where (|26p is referred to as 
the 11 expectation step" and (|27p as the " maximisation step" . 

Of course, the above procedure is fully operational provided (p7|) can 
also be easily solved. This occurs for instance for finite-mixture models 
determined by c fixed distributions /ij (with Y^j=\ hj = 1 £ot q = 1, . . . , c), 
such that the categories j = 1, . . . , m read as product categories of the form 
j = (J, q) with 

c 

0j = = Pq h q j p q >0 Pi = 1 G ^ = 2 p i h J 

9=1 9 

where the "mixing proportions" p q are freely adjustable. Solving (|2"7|) yields 

„(n+l) _ V- f (n) _ («) V- h q jFj 
J J 2^r n j Pr 
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which converges towards the optimal mixing proportions p q , unique since 
Q is convex. Continuous versions of the algorithm (in which J represents a 
position in an Euclidean space) generate the so-called soft clustering algo- 
rithms, which can be further restricted to the hard clustering and .fT-means 
algorithms. However, the distributions h!j used in the latter cases generally 
contain additional adjustable parameters (typically the mean and the covari- 
ance matrix of normal distributions), which break down the convexity of Q 
and cause the algorithm to converge towards local minima. 



6 Beyond independence: Markov chain models and 
texts 

As already proposed by Shannon (1948), the independence formalism can 
be extended to stationary dependent sequences, that is on categorical time 
series or "textual" data D = x\%2 ■ ■ ■ x n , such as 

D=bbaabbaabbbaabbbaabbbaabbaabaabbaabbaabbaabbaabbaabbaabb 
aabaabbaabbbaabaabaabbbaabbbaabbaabbaabbaabaabbbaabbbaabbaa 
baabaabbaabaabbaabbaabbbaabbaabaabaabbaabbbbaabbaabaabaabaa 
baabaabaabbaabbaabbaabbbbaab . 

In this context, each occurence x\ constitutes a letter taking values ujj in 
a state space $7, the alphabet, of cardinality m = A sequence of r letters 
a := uj\ . . . co r G fT is an r-gram. In our example, n = 202, Q = {a, b}, 
m = 2, Vi 2 = {aa, ab, ba, bb}, etc. 



6.1 Markov chain models 

A Markov chain model of order r is specified by the conditional probabilities 
f M (u;\a)>0 ueQ aen r ^2f M (u\a) = l. 

f M (u)\a) is the probability that the symbol following the r-gram a is u. It 
obtains from the stationary distributions f M (aoj) and f M (a) as 

fM (l ,. - f M (^) 

The set M. r of models of order r constitutes an exponential family, nested 
as M r C M r +i for all r > 0. In particular, Mo denotes the independence 
models, and M± the ordinary (first-order) Markov chains. 
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The corresponding empirical distributions f D {a) give the relative pro- 
portion of r-grams a € fT in the text D. They obtain as 



n{a) 



n — r + 1 



with 



aeQ r 



a 



where n(a) counts the number of occurrences of a in D. In the above 
example, the tetragrams counts are for instance: 



a 


n(a) 


a 


n{a) 


a 


n(a) 


aaaa 





aaab 





aaba 


16 


aabb 


35 


abaa 


16 


abab 





abba 


22 


abbb 


11 


baaa 





baab 


51 


baba 





babb 





bbaa 


35 


bbab 





bbba 


11 


bbbb 


2 






total 


199 



6.2 Simulating a sequence 

Under the assumption that a text follows a r-order model M r , empirical 
distributions f D (a) (with a € Q r+1 ) converge for n large to f M (a). The 
latter define in turn r-order transition probabilities, allowing the generation 
of new texts, started from the stationary distribution. 



6.2.1 Example 

The following sequences are generated form the empirical probability tran- 
sitions of the Universal declaration of Human Rights, of length n = 8' 149 
with m = 27 states (the alphabet + the blank, without punctuation): 

r = (independent process) 

iahthire edr pynuecu d lae mrfa ssooueoilhnid nritshf ssmo 
nise yye noa it eosc e lrc jdnca tyopaooieoegasrors c hel 
niooaahettnoos rnei s sosgnolaotd t atiet 

r = 1 (first-order Markov chain) 

erionjuminek in 1 ar hat arequbjus st d ase scin ero tubied 
pmed beetl equly shitoomandorio tathic wimof tal ats evash 
indimspre tel sone aw onere pene e ed uaconcol mo atimered 
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r = 2 (second-order Markov chain) 

mingthe rint son of the frentery and com andepent the halons 
hal to coupon efornitity the rit noratinsubject will the the 
in priente hareeducaresull ch infor aself and evell 

r = 3 (third-order Markov chain) 

law socience of social as the right or everyone held 
genuinely available sament of his no one may be enties the 
right in the cons as the right to equal co one soveryone 

r = 4 (fourth-order Markov chain) 

are endowed with other means of full equality and to law no one 
is the right to choose of the detent to arbitrarily in science 
with pay for through freely choice work 

r = 9 (ninth-order Markov chain) 

democratic society and is entitled without interference 

and to seek receive and impartial tribunals for acts violating 

the fundamental rights indispensable for his 

Of course, empirical distributions are expected to accurately estimate 
model distributions for n large enough, or equivalently for r small enough, 
typically for 

1 Inn 
2mm 

Simulations with r above about r max (here roughly equal to 2) are over- 
parameterized: the number of parameters to be estimated exceeds the sam- 
ple abilities to do so, and simulations replicate fragments of the initial text 
rather than typical r-grams occurences of written English in general, pro- 
viding a vivid illustration of the curse of dimensionality phenomenon. 

6.3 Entropies and entropy rate 

The r-gram entropy and the conditional entropy of order r associated to a 
(model or empirical) distribution / are defined by 

H r {f) ■=- E A") hxf(a) = H(X 1 ,...,X r ) 



hr+i(f) ■=- J2 /(a)E/M a ) ^fHa) = H r+1 (f)-H r (f) = H(X r+1 \X 1 ,...,X r ) 
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The quantity h r (f) is non-increasing in r. Its limit defines the entropy 
rate, measuring the conditional uncertainty on the next symbol knowing the 
totality of past occurrences: 

h(f) := lim h r (f) = lim entropy rate. 

By construction, < h(f) < lnm, and the so-called redundancy R := 
1 - (h/lnm) satisfies < R < 1. 

The entropy rate measures the randomness of the stationary process: 
h(f) = lnm (i.e. R = 1) characterizes a maximally random process is, 
that is a dice model with uniform distribution. The process is ultimately 
deterministic iff h(f) = (i.e. R = 0). 

Shannon's estimate of the entropy rate of the written English on m = 27 
symbols is about h = 1.3 bits per letter, that is h = 1.3 x In 2 = 0.90 nat, 
corresponding to R = 0.73: hundred pages of written English are in theory 
compressible without loss to 100 — 73 = 27 pages. Equivalently, using an 
alphabet containing exp(0.90) = 2.46 symbols only (and the same number 
of pages) is in principle sufficient to code the text without loss. 

6.3.1 Example: entropy rates for ordinary Markov chains 

For a regular Markov chain of order 1 with transition matrix W = (vjjk) 
and stationary distribution 7Tj, one gets 

h\ = — 71 j In 7Tj > /12 = /13 = • • • = — 71 3 5^ w jk m w jk = h . 

j 3 k 

Identity h\ = h holds iff Wjk = vr^, that is if the process is of order r = 0. 
Also, h — s> iff W tends to a permutation, that is iff the process becomes 
deterministic. 

6.4 The asymptotic rate for Markov chains 

Under the assumption of a model f M of order r, the probability to observe 
D is 

n 

p(D\.f M ) = = n n /vh^ e e ^ = n 

i=l u£!laeS] r uei1aei1 r 

where finite "boundary effects", possibly involving the first or last r sym- 
bols of the sequence, are here neglected. Also, noting that a total of 
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n(a)\ / Ylcj n(auo)\ permutations of the sequence generate the same f (u\a), 
taking the logarithm and using Stirling approximation yields the asymptotic 
rate formula for Markov chains 

P(f D \f M ) = exp(-n Kr+1 (f D \\f M )) (28) 



where K r+1 (f\\g) := K r+1 (f\\g) - K r (f\\g)) = £ /(a) £ /(w|a) In 



/M«) 

q(uj\a) 

and K r (f\\g)):= ^/(a)lnM. 
Setting r = returns the asymptotic formula (|3J) for independence models. 



6.5 Testing the order of an empirical sequence 

For s < r, write a S O r as a = (Pi) where f3 € Q r ~ s and 7 G s . Consider 
s-order models of the form f M (u}\{3j) = f M (u\'j). It is not difficult to prove 
the identity 

min Kr+1 (f D \\f M ) = -H r+1 (f D ) + H r (f D ) + H s+1 (f D ) - H s (f D ) 

f M eM s 

= h s+1 (f D )-h r+1 (f D )>0. (29) 

As an application, consider, as in section [23], the log-likelihood nested test 
of Ho within H u opposing H : u f M € M s n against H x : " f M e M r " . 
Identities ([28|) and (f29|) lead to the rejection of #0 if 

2n [/i s+ i(/ D ) - /i r+ i(/ D )] > xi- a [(™-m™ r -rn •)] . (30) 



6.5.1 Example: test of independence 

For r = 1 and s = 0, the test ([30]) amonts in testing independence, and the 
decision variable 

hi(f D ) - h 2 (f D ) = ff x (/ B ) + H^f ) - H 2 (f D ) = H{X X ) + H(X 2 ) - H(X 1 ,X 2 ) = I{X 1 : X 2 ) 

is (using stationarity) nothing but the mutual information between two con- 
secutive symbols X\ and X2, as expected from example 13.2.21 
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6.5.2 Example: sequential tests 



For r = 1 and s = r — 1, inequality (|30|) implies that the model at least of 
order r. Setting r = 1,2, . . . ,r max (with df = (m — l) 2 ??/" 1 ) constitutes a 
sequential procedure permitting to detect the order of the model, if existing. 

For instance, a binary Markov chain of order r = 3 and length n = 1024 
in Q = {a, b} can be simulated as X t := g(-^(Z t + Z t ~i + Z t -2 + Zt-3)), where 
Zf are i.i.d. variables uniformly distributed as ~ U(0,1), and g(z) := a if 
z >\ and g(z) := b if z < \. Application of the procedure at significance 
level a = 0.05 for r = 1, ... 5 = r max is summarised in the following table, 
and shows to correctly detect the order of the model: 



r 


h r (f u ) 


2n[h r {f u ) - h r+1 {f u )\ 


df 




1 


0.692 


0.00 


1 


3.84 


2 


0.692 


2.05 


2 


5.99 


3 


0.691 


110.59 


4 


9.49 


4 


0.637 


12.29 


8 


15.5 


5 


0.631 


18.02 


16 


26.3 



6.6 Heating and cooling texts 

Let f(uj\a) (with u G and a € Q r ) denote a conditional distribution of 
order r. In analogy to formula (122 p of Statistical Mechanics, the distribution 
can be "heated" or "cooled" at relative temperature T = 1//3 to produce 
the so-called annealed distribution 



Sequences generated with the annealed transitions hence simulate texts pos- 
sessing a temperature T relatively to the original text. 



6.6.1 Example: simulating hot and cold English texts 

Conditional distributions of order 3, retaining tetragram structure, have 
been calibrated from Jane Austen's novel Emma (1816), containing n = 
868'945 tokens belonging to m = 29 types (the alphabet, the blank, the 
hyphen and the apostrophe). A few annealed simulations are shown be- 
low, where the first trigram was sampled from the stationary distribution 
(Bavaud and Xanthos, 2002). 

/3 = I (original process) 
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f eeliciousnest miss abbon hear jane is arer that isapple did 
ther by the withour our the subject relevery that amile 
sament is laugh in ' emma rement on the come februptings he 

P = 0.1 (10 times hotter) 

torables - hantly elterdays doin said just don't check comedina 
inglas ratef usandinite his happerall bet had had habiticents' 
oh young most brothey lostled wife favoicel let you cology 

f3 = 0.01 (100 times hotter): any transition having occurred in the orig- 
inal text tends to occur again with uniform probability, making the heated 
text maximally unpredictable. However, most of the possible transitions did 
not occur initially, which explains the persistence of the English-like aspect. 

et-chaist-temseliving dwelf-ash eignansgranquick-gatef ullied 
georgo namissedeed fessnee th thusestnessf ul-timencurves - 
him duraguesdaird vulgentroneousedatied yelaps isagacity in 

(3 = 2 (2 times cooler) : conversely, frequent (rare) transitions become 
even more frequent (rare), making the text fairly predictable. 

's good of his compassure is a miss she was she come to the 
of his and as it it was so look of it i do not you with her 
that i am superior the in ther which of that the half - and 

(3 = 4 (4 times cooler): in the low temperature limit, dynamics is trapped 
in the most probable initial transitions and texts properly become crystal- 
like, as expected from Physics (see example 14. 3. 2\) : 

11 the was the was the was the was the was the was the was 
the was the was the was the was the was the was the was the 
was the was the was the was the was the was the was the was 



6.7 Additive and multiplicative text mixtures 

In the spirit of section 15. 1| additive and multiplicative mixtures of two con- 
ditional distributions f(u>\a) and g(u\a) of order r can be constructed as 

h x (u>\a) := A/(u,|a) + (1 - \)g( u \a) h^\a) := ^f^^^^ 

where < A < 1 and < (J, < 1. The resulting transition exists if it exists 
in at least one of the initial distributions (additive mixtures) or in both 
distributions (multiplicative mixtures). 
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6.7.1 Example: additive mixture of English and French 

Let g denote the empirical distribution of order 3 of example (|6.6.1|) . and 
define / as the corresponding distribution estimated on the n = 725'001 first 
symbols of the French novel La bete humaine from Emile Zola. Additive 
simulations with various values of A read (Bavaud and Xanthos, 2002): 

A = 0.17 

11 thin not alarly but alabouthould only to comethey had be 
the sepant a was que lify you i bed at it see othe to had 
state cetter but of i she done a la veil la preckone forma feel 

A = 0.5 

daband shous ne f indissouservait de sais comment do be certant 
she cette l'ideed se point le fair somethen l'autres jeune suit 
onze muchait satite a ponded was si je lui love toura 

A = 0.83 

les appelleur voice the toodhould son as or que aprennel un 
revincontait en at on du semblait juge yeux plait etait 
resoinsittairl on in and my she comme elle ecreta-t-il avait 
autes foiser 

showing, as expected, a gradual transformation from English- to French- 
likeness with increasing A. 

6.7.2 Example: multiplicative mixture of English and French 

Applied now on multiplicative mixtures, the procedure described in example 
[6X2 yields (Bavaud and Xanthos, 2002) 

H = 0.17 

licatellence a promine agement ano ton becol car emm*** ever 
ans touche-***i harriager gonistain ans tole elegards intellan 
enour bellion genea***he succept wa***n instand instilliaristinutes 

H = 0.5 

n neignit innerable quit tole ballassure cause on an une grite 

chambe ner martient infine disable prisages creat mellesselles 

dut***grange accour les norance trop mise une les emm*** 

\i = 0.83 

es terine fille son mainternistonsidenter ing sile celles 
tout a pard elevant poingerent une graver dant lesses 
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jam***core son luxu***que eles visagemensation lame cendance 

where the symbol *** indicates that the process is trapped in a trigram 
occuring in the English, but not in the French sample (or vice versa). Again, 
the French-likeness of the texts increases with [i. Interestingly enough, some 
simulated subsequences are arguably evocative of Latin, whose lexicon con- 
tains an important part of the forms common to English and French. 

From an inferential point of view, the multiplicative mixture is of the 
form (|12p . and hence lies at the boundary of the optimal Neyman-Pearson 
decision region, governing the asymptotic rate of errors of both kinds, namely 
confounding French with English or English with French. 
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